The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
You need to identify the best possible model that will give the required performance
Objective
Explore and visualize the dataset. Build a classification model to predict if the customer is going to churn or not Optimize the model using appropriate techniques Generate a set of insights and recommendations that will help the bank
Data Dictionary:
Model Performances
Productionize the model
Actionable Insights & Recommendations
Read given data to data frame and understand data nature, given features, total records, given data has any missing values or duplicate data, outliers.
Visualize data and and understand data range and outliers
Load all standard python library packages.
# this will help in making the Python code more structured automatically
%load_ext nb_black
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
import scipy.stats as stats
from sklearn import metrics
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
Read given csv file BankChurners.csv and load to data frame data.
# reading csv data given from bank and load to data frame
bank_data = pd.read_csv("BankChurners.csv")
# copying orignal data so that when changing data we dont lose original
data = bank_data.copy()
data.head(5)
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
data.tail(5)
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.0 | 1851 | 2152.0 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.0 | 0 | 5409.0 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.0 | 0 | 5281.0 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.0 | 1961 | 8427.0 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
few categorical features and mostly numerical features
data.shape
(10127, 21)
observations on data
checking data types of all columns
Since CLIENTNUM has no relation with other features and it is row number we can drop this column
# Drop CLIENTNUM Columns
data.drop("CLIENTNUM", axis=1, inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null object 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null object 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null object 5 Marital_Status 9378 non-null object 6 Income_Category 10127 non-null object 7 Card_Category 10127 non-null object 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(9), object(6) memory usage: 1.5+ MB
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.0 | 46.325960 | 8.016814 | 26.0 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.0 | 2.346203 | 1.298908 | 0.0 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.0 | 35.928409 | 7.986416 | 13.0 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.0 | 3.812580 | 1.554408 | 1.0 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.0 | 2.341167 | 1.010622 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.0 | 2.455317 | 1.106225 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.0 | 8631.953698 | 9088.776650 | 1438.3 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.0 | 1162.814061 | 814.987335 | 0.0 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.0 | 7469.139637 | 9090.685324 | 3.0 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | 4404.086304 | 3397.129254 | 510.0 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.0 | 64.858695 | 23.472570 | 10.0 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
data.describe(include="object").T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
observations on data summary
Target Feature - has 2 values. Existing customers has major occurence 8500, Remaining Attrited CustomerFemale are 5358 and Remaining Male0 to 5. 6 different values, Graduate are most common with 3128 records 3 Possible values with Married are most commonIncome_Category: Annual Income Category of the account holder - 6 possible values, Less than 40K are most common
Card_Category: Type of Card - 4 different values and Blue being Majority with 9436 records
13 to 56. and Mean value is 361 to 6 and mean is almost 40 to 6 and mean is around 20 to 6 and mean value is 21438.3 to 34516.000 and with possible outlier values and mean is around 8631.950 to 2517, we have to check value distribution for check outlier values3.0 to 34516.0 and has outlier values510.0 to 18484.0 and has outlier values10.0 to 139.0 and has outlier values0 to 3.7 and has outlier values0 to 3.3 and has outlier values0 to .99 and has outlier valueslets check for any duplicate values
# check for any duplicate data
data[data.duplicated()]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio |
|---|
No Duplicate Data found, No actions reqd.
lets check which columns has some null values, how many null values
# Prints total null value count(s) for all columns in input data frame
def print_null_info(df):
"""
Prints total null value count(s) & % for all columns in input data frame
"""
nullInfo = {"missing count": df.isnull().sum(), "missing %": df.isnull().mean()}
# Creates pandas DataFrame.
nulldata = pd.DataFrame(nullInfo)
return nulldata[nulldata["missing count"] > 0].sort_values(
by="missing count", ascending=False
)
# Prints unique value counts, top 10 value & count(s) for all category columns in input data frame
def print_category_value_counts(df, column_names):
"""
Prints unique value counts, top 10 value & count(s) for all category columns in input data frame
"""
print()
for typeval, col in zip(df[column_names].dtypes, df[column_names]):
print()
print(f"Column name : {col} has total {df[col].nunique()} unique values")
print()
print(df[col].value_counts()[0:10])
print()
print("-" * 50)
print_null_info(data)
| missing count | missing % | |
|---|---|---|
| Education_Level | 1519 | 0.149995 |
| Marital_Status | 749 | 0.073961 |
observations on data missing
We dont want to delete missing values. We have to treat these missing values so that we dont lose those data
# counting the number of missing values per row
num_missing = data.isnull().sum(axis=1)
num_missing.value_counts()
0 7973 1 2040 2 114 dtype: int64
observations on data missing by row
data[num_missing == 2].sample(n=15)
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3701 | Existing Customer | 43 | M | 3 | NaN | NaN | $80K - $120K | Blue | 38 | 6 | 2 | 3 | 20348.0 | 1732 | 18616.0 | 0.674 | 1624 | 27 | 0.688 | 0.085 |
| 8166 | Existing Customer | 50 | M | 2 | NaN | NaN | $120K + | Blue | 32 | 2 | 3 | 1 | 25645.0 | 1083 | 24562.0 | 0.757 | 4302 | 63 | 0.500 | 0.042 |
| 9332 | Existing Customer | 56 | M | 1 | NaN | NaN | $60K - $80K | Blue | 43 | 2 | 2 | 3 | 10602.0 | 1083 | 9519.0 | 0.656 | 14084 | 91 | 0.750 | 0.102 |
| 6406 | Existing Customer | 44 | M | 1 | NaN | NaN | $80K - $120K | Blue | 29 | 6 | 1 | 2 | 8315.0 | 1008 | 7307.0 | 0.718 | 4088 | 66 | 0.610 | 0.121 |
| 6532 | Existing Customer | 43 | F | 2 | NaN | NaN | Less than $40K | Blue | 36 | 4 | 3 | 3 | 1780.0 | 1541 | 239.0 | 0.839 | 5008 | 93 | 0.722 | 0.866 |
| 3770 | Attrited Customer | 38 | M | 2 | NaN | NaN | $60K - $80K | Blue | 26 | 2 | 2 | 3 | 11252.0 | 1129 | 10123.0 | 0.462 | 1766 | 31 | 0.476 | 0.100 |
| 4558 | Existing Customer | 48 | F | 4 | NaN | NaN | Less than $40K | Blue | 29 | 6 | 3 | 4 | 3212.0 | 2517 | 695.0 | 0.785 | 4284 | 69 | 1.300 | 0.784 |
| 2928 | Existing Customer | 43 | F | 4 | NaN | NaN | abc | Blue | 34 | 5 | 3 | 3 | 12778.0 | 1528 | 11250.0 | 0.777 | 3337 | 69 | 0.769 | 0.120 |
| 4840 | Existing Customer | 51 | F | 2 | NaN | NaN | $40K - $60K | Blue | 44 | 4 | 6 | 4 | 5489.0 | 1665 | 3824.0 | 0.742 | 4010 | 75 | 0.786 | 0.303 |
| 6312 | Attrited Customer | 46 | F | 3 | NaN | NaN | Less than $40K | Blue | 41 | 5 | 3 | 2 | 4710.0 | 0 | 4710.0 | 0.717 | 2690 | 43 | 0.955 | 0.000 |
| 3080 | Existing Customer | 51 | M | 2 | NaN | NaN | $80K - $120K | Blue | 39 | 6 | 1 | 3 | 9854.0 | 1886 | 7968.0 | 0.921 | 3912 | 76 | 0.949 | 0.191 |
| 726 | Existing Customer | 42 | M | 5 | NaN | NaN | $120K + | Blue | 36 | 4 | 1 | 2 | 34516.0 | 1839 | 32677.0 | 0.690 | 1230 | 34 | 0.889 | 0.053 |
| 5322 | Existing Customer | 48 | M | 2 | NaN | NaN | $40K - $60K | Blue | 43 | 5 | 2 | 2 | 1761.0 | 1249 | 512.0 | 0.936 | 4410 | 73 | 0.622 | 0.709 |
| 9772 | Existing Customer | 30 | M | 1 | NaN | NaN | Less than $40K | Blue | 13 | 1 | 3 | 1 | 3789.0 | 1782 | 2007.0 | 0.850 | 15670 | 116 | 0.785 | 0.470 |
| 6732 | Existing Customer | 63 | M | 0 | NaN | NaN | $60K - $80K | Blue | 46 | 6 | 2 | 2 | 23925.0 | 1494 | 22431.0 | 0.704 | 3677 | 55 | 0.719 | 0.062 |
data[num_missing == 1].sample(n=15)
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5630 | Existing Customer | 44 | F | 1 | Graduate | NaN | Less than $40K | Blue | 32 | 3 | 3 | 3 | 1780.0 | 1338 | 442.0 | 0.966 | 4268 | 87 | 0.673 | 0.752 |
| 5997 | Existing Customer | 52 | F | 3 | NaN | Single | Less than $40K | Blue | 37 | 4 | 2 | 2 | 1554.0 | 926 | 628.0 | 0.785 | 4997 | 79 | 0.927 | 0.596 |
| 5673 | Attrited Customer | 31 | F | 1 | NaN | Single | Less than $40K | Blue | 20 | 3 | 1 | 2 | 1695.0 | 0 | 1695.0 | 0.561 | 2428 | 43 | 0.593 | 0.000 |
| 4909 | Existing Customer | 49 | F | 2 | NaN | Single | abc | Blue | 39 | 4 | 1 | 3 | 11320.0 | 0 | 11320.0 | 0.669 | 2845 | 57 | 0.727 | 0.000 |
| 4278 | Existing Customer | 44 | M | 4 | NaN | Single | $80K - $120K | Blue | 38 | 5 | 3 | 4 | 3936.0 | 2024 | 1912.0 | 0.988 | 5001 | 66 | 0.692 | 0.514 |
| 3171 | Attrited Customer | 41 | F | 3 | NaN | Single | $40K - $60K | Blue | 36 | 3 | 2 | 4 | 5317.0 | 0 | 5317.0 | 0.699 | 2003 | 29 | 0.318 | 0.000 |
| 3721 | Existing Customer | 44 | F | 4 | NaN | Married | abc | Blue | 37 | 4 | 3 | 3 | 5826.0 | 0 | 5826.0 | 0.689 | 3756 | 73 | 0.921 | 0.000 |
| 4989 | Existing Customer | 49 | M | 3 | NaN | Married | $80K - $120K | Blue | 43 | 6 | 1 | 1 | 2191.0 | 1414 | 777.0 | 0.635 | 3967 | 76 | 0.767 | 0.645 |
| 657 | Attrited Customer | 48 | F | 2 | NaN | Married | Less than $40K | Blue | 36 | 5 | 1 | 3 | 7151.0 | 800 | 6351.0 | 0.986 | 854 | 22 | 0.375 | 0.112 |
| 4951 | Attrited Customer | 51 | F | 2 | NaN | Married | Less than $40K | Blue | 36 | 1 | 2 | 2 | 2114.0 | 0 | 2114.0 | 0.875 | 2872 | 45 | 0.500 | 0.000 |
| 7502 | Existing Customer | 49 | F | 4 | NaN | Single | $40K - $60K | Blue | 39 | 6 | 2 | 3 | 7382.0 | 1326 | 6056.0 | 0.806 | 4792 | 65 | 0.667 | 0.180 |
| 9189 | Existing Customer | 52 | M | 2 | Uneducated | NaN | $120K + | Blue | 42 | 1 | 1 | 3 | 33552.0 | 1252 | 32300.0 | 0.793 | 14490 | 98 | 0.815 | 0.037 |
| 6257 | Existing Customer | 40 | F | 4 | NaN | Married | $40K - $60K | Blue | 30 | 5 | 2 | 2 | 2166.0 | 1465 | 701.0 | 0.857 | 5097 | 82 | 0.673 | 0.676 |
| 9461 | Attrited Customer | 34 | M | 3 | High School | NaN | $80K - $120K | Silver | 21 | 1 | 2 | 3 | 34516.0 | 0 | 34516.0 | 0.794 | 9177 | 50 | 1.083 | 0.000 |
| 4541 | Attrited Customer | 42 | M | 5 | NaN | Married | $60K - $80K | Blue | 36 | 2 | 3 | 2 | 4963.0 | 0 | 4963.0 | 0.637 | 2335 | 40 | 0.429 | 0.000 |
At times, the missing information is valuable itself, and to impute it with the most common class won’t be appropriate. In such a case, we can replace them with a value like “Unknown” or “Missing” using the fillna() method.
since high % is missing replacing with high frequence value may result in imblance in data
# replacing missing with "unknown" Label for intial analysis
data.fillna("Unknown", inplace=True)
category_cols = [
"Attrition_Flag",
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
]
print_category_value_counts(data, category_cols)
Column name : Attrition_Flag has total 2 unique values Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 -------------------------------------------------- Column name : Gender has total 2 unique values F 5358 M 4769 Name: Gender, dtype: int64 -------------------------------------------------- Column name : Education_Level has total 7 unique values Graduate 3128 High School 2013 Unknown 1519 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 -------------------------------------------------- Column name : Marital_Status has total 4 unique values Married 4687 Single 3943 Unknown 749 Divorced 748 Name: Marital_Status, dtype: int64 -------------------------------------------------- Column name : Income_Category has total 6 unique values Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 -------------------------------------------------- Column name : Card_Category has total 4 unique values Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 --------------------------------------------------
Visualize all features before any data clean up and understand what data needs cleaning and fixing.
Univariate analysis helps to check data skewness and possible outliers and spread of the data. Bivariate analysis helps to check data relation between two features.
creating a method that can plot univariate chart with histplot, boxplot and barchart %
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
## this method generate joint plot of x vs y feature
def generate_bivariate_chart(data, xfeature, yfeature, hue=None):
"""
common method to generate joint plot for various columns
hue param is optional
"""
sns.set_style("darkgrid")
print(f"Genrating Charts for feature x : {xfeature}, y : {yfeature}")
sns.jointplot(
data=data,
x=xfeature,
y=yfeature,
palette="winter",
height=10,
kind="scatter",
hue=hue,
)
# writing a method that can take a column as input and print all the values falling outside the range of Limts of IQR
def check_outlier_using_IQR(column, limit=3):
"""
Check Limts * IQR for each values and prints the ones falls outside the range.
"""
quartiles = np.quantile(data[column][data[column].notnull()], [0.25, 0.75])
limit_iqr = limit * (quartiles[1] - quartiles[0])
outlier = data.loc[np.abs(data[column] - data[column].median()) > limit_iqr, column]
print();
print(f"Column : {column} Outlier(s) check");
print(f"Mean : {data[column].mean()}, Median : {data[column].median()}, Min : {data[column].min()}, Max : {data[column].max()}");
print(f"Q1 = {quartiles[0]}, Q3 = {quartiles[1]}, {limit}*IQR = {limit_iqr}, Total Outlier(s) : {outlier.size} \n")
if outlier.size > 10:
print(f"listing 10 sample outliers");
print(outlier.sample(10))
else:
print(f"listing all outliers");
print(outlier)
print("-" * 50)
# Observations on Customer_age
histogram_boxplot(data, "Customer_Age")
# checking outliers range
check_outlier_using_IQR("Customer_Age", 3)
Column : Customer_Age Outlier(s) check Mean : 46.32596030413745, Median : 46.0, Min : 26, Max : 73 Q1 = 41.0, Q3 = 52.0, 3*IQR = 33.0, Total Outlier(s) : 0 listing all outliers Series([], Name: Customer_Age, dtype: int64) --------------------------------------------------
# Observations on Dependent_count
histogram_boxplot(data, "Dependent_count")
# checking outliers range
check_outlier_using_IQR("Dependent_count", 3)
Column : Dependent_count Outlier(s) check Mean : 2.3462032191172115, Median : 2.0, Min : 0, Max : 5 Q1 = 1.0, Q3 = 3.0, 3*IQR = 6.0, Total Outlier(s) : 0 listing all outliers Series([], Name: Dependent_count, dtype: int64) --------------------------------------------------
# Observations on Months_on_book
histogram_boxplot(data, "Months_on_book")
# checking outliers range
check_outlier_using_IQR("Months_on_book", 3)
Column : Months_on_book Outlier(s) check Mean : 35.928409203120374, Median : 36.0, Min : 13, Max : 56 Q1 = 31.0, Q3 = 40.0, 3*IQR = 27.0, Total Outlier(s) : 0 listing all outliers Series([], Name: Months_on_book, dtype: int64) --------------------------------------------------
# Observations on Total_Relationship_Count
histogram_boxplot(data, "Total_Relationship_Count")
# checking outliers range
check_outlier_using_IQR("Total_Relationship_Count", 3)
Column : Total_Relationship_Count Outlier(s) check Mean : 3.8125802310654686, Median : 4.0, Min : 1, Max : 6 Q1 = 3.0, Q3 = 5.0, 3*IQR = 6.0, Total Outlier(s) : 0 listing all outliers Series([], Name: Total_Relationship_Count, dtype: int64) --------------------------------------------------
# Observations on Months_Inactive_12_mon
histogram_boxplot(data, "Months_Inactive_12_mon")
# checking outliers range
check_outlier_using_IQR("Months_Inactive_12_mon", 3)
Column : Months_Inactive_12_mon Outlier(s) check Mean : 2.3411671768539546, Median : 2.0, Min : 0, Max : 6 Q1 = 2.0, Q3 = 3.0, 3*IQR = 3.0, Total Outlier(s) : 124 listing 10 sample outliers 12 6 2111 6 4621 6 477 6 8540 6 4594 6 6535 6 8200 6 2104 6 6105 6 Name: Months_Inactive_12_mon, dtype: int64 --------------------------------------------------
# Observations on Contacts_Count_12_mon
histogram_boxplot(data, "Contacts_Count_12_mon")
# checking outliers range
check_outlier_using_IQR("Contacts_Count_12_mon", 3)
Column : Contacts_Count_12_mon Outlier(s) check Mean : 2.4553174681544387, Median : 2.0, Min : 0, Max : 6 Q1 = 2.0, Q3 = 3.0, 3*IQR = 3.0, Total Outlier(s) : 54 listing 10 sample outliers 6801 6 9809 6 3049 6 4189 6 9655 6 5117 6 3157 6 3851 6 7223 6 9212 6 Name: Contacts_Count_12_mon, dtype: int64 --------------------------------------------------
# Observations on Credit_Limit
histogram_boxplot(data, "Credit_Limit")
# checking outliers range
check_outlier_using_IQR("Credit_Limit", 3)
Column : Credit_Limit Outlier(s) check Mean : 8631.953698034848, Median : 4549.0, Min : 1438.3, Max : 34516.0 Q1 = 2555.0, Q3 = 11067.5, 3*IQR = 25537.5, Total Outlier(s) : 664 listing 10 sample outliers 9847 34516.0 480 34516.0 3867 34516.0 8113 34516.0 2604 33996.0 9967 33905.0 1285 34516.0 2010 31762.0 2916 34516.0 1025 34516.0 Name: Credit_Limit, dtype: float64 --------------------------------------------------
data[data["Credit_Limit"] > 26000].sample(25)
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6411 | Attrited Customer | 44 | F | 2 | High School | Unknown | abc | Gold | 35 | 3 | 3 | 3 | 34516.0 | 0 | 34516.0 | 0.767 | 2227 | 44 | 0.630 | 0.000 |
| 2652 | Existing Customer | 38 | M | 2 | Post-Graduate | Single | $60K - $80K | Silver | 36 | 3 | 3 | 4 | 34516.0 | 1954 | 32562.0 | 0.436 | 1872 | 52 | 0.486 | 0.057 |
| 4140 | Existing Customer | 45 | F | 2 | High School | Single | abc | Blue | 37 | 4 | 2 | 4 | 27804.0 | 0 | 27804.0 | 1.301 | 2761 | 57 | 0.966 | 0.000 |
| 9630 | Existing Customer | 36 | M | 2 | High School | Married | $120K + | Blue | 36 | 3 | 1 | 2 | 34516.0 | 939 | 33577.0 | 0.826 | 15552 | 107 | 0.754 | 0.027 |
| 8768 | Existing Customer | 50 | M | 1 | Unknown | Unknown | $120K + | Silver | 45 | 2 | 2 | 0 | 34516.0 | 1514 | 33002.0 | 0.736 | 7333 | 72 | 1.000 | 0.044 |
| 8739 | Existing Customer | 60 | F | 1 | Doctorate | Single | abc | Silver | 48 | 1 | 2 | 1 | 34516.0 | 1542 | 32974.0 | 0.603 | 7443 | 98 | 0.556 | 0.045 |
| 9113 | Existing Customer | 45 | M | 2 | Graduate | Married | $60K - $80K | Platinum | 31 | 2 | 2 | 1 | 34516.0 | 1308 | 33208.0 | 0.746 | 8773 | 105 | 0.780 | 0.038 |
| 9764 | Existing Customer | 31 | F | 0 | College | Married | abc | Silver | 24 | 2 | 3 | 1 | 34516.0 | 2032 | 32484.0 | 0.820 | 14544 | 104 | 0.705 | 0.059 |
| 490 | Existing Customer | 48 | M | 3 | High School | Married | $80K - $120K | Blue | 36 | 5 | 1 | 3 | 30579.0 | 1351 | 29228.0 | 0.759 | 1289 | 19 | 0.727 | 0.044 |
| 5020 | Existing Customer | 50 | M | 2 | Unknown | Married | $120K + | Gold | 35 | 4 | 5 | 3 | 34516.0 | 668 | 33848.0 | 0.894 | 3931 | 77 | 0.791 | 0.019 |
| 4165 | Existing Customer | 57 | M | 2 | High School | Single | $80K - $120K | Blue | 52 | 6 | 3 | 4 | 32409.0 | 0 | 32409.0 | 0.943 | 4218 | 66 | 0.571 | 0.000 |
| 9227 | Attrited Customer | 60 | M | 0 | College | Married | $80K - $120K | Blue | 50 | 1 | 4 | 2 | 29239.0 | 0 | 29239.0 | 1.057 | 8169 | 65 | 0.585 | 0.000 |
| 6371 | Existing Customer | 55 | M | 2 | Uneducated | Single | $60K - $80K | Silver | 48 | 5 | 3 | 3 | 34516.0 | 1384 | 33132.0 | 0.625 | 3807 | 82 | 0.864 | 0.040 |
| 9024 | Existing Customer | 52 | M | 2 | Doctorate | Married | $120K + | Blue | 44 | 1 | 2 | 2 | 34516.0 | 1382 | 33134.0 | 0.856 | 8604 | 108 | 0.714 | 0.040 |
| 6537 | Existing Customer | 37 | M | 3 | Unknown | Single | $80K - $120K | Blue | 22 | 4 | 2 | 2 | 29898.0 | 1414 | 28484.0 | 0.453 | 3883 | 68 | 0.700 | 0.047 |
| 9714 | Attrited Customer | 39 | M | 2 | Uneducated | Married | $120K + | Silver | 19 | 2 | 1 | 2 | 34516.0 | 796 | 33720.0 | 0.841 | 7721 | 75 | 0.667 | 0.023 |
| 8733 | Existing Customer | 55 | M | 3 | High School | Single | $80K - $120K | Gold | 49 | 2 | 1 | 3 | 34516.0 | 810 | 33706.0 | 0.787 | 7917 | 95 | 0.667 | 0.023 |
| 79 | Existing Customer | 47 | M | 2 | Graduate | Married | $80K - $120K | Blue | 38 | 6 | 3 | 2 | 28904.0 | 1899 | 27005.0 | 0.850 | 1334 | 35 | 0.400 | 0.066 |
| 280 | Existing Customer | 43 | M | 1 | Graduate | Single | $80K - $120K | Silver | 37 | 4 | 3 | 2 | 34516.0 | 1440 | 33076.0 | 1.117 | 1575 | 34 | 2.400 | 0.042 |
| 9532 | Existing Customer | 45 | M | 4 | Graduate | Married | $80K - $120K | Blue | 32 | 1 | 1 | 3 | 26819.0 | 1657 | 25162.0 | 0.816 | 14450 | 111 | 0.762 | 0.062 |
| 2786 | Existing Customer | 43 | M | 2 | Graduate | Married | $120K + | Silver | 31 | 6 | 2 | 4 | 34516.0 | 2398 | 32118.0 | 0.917 | 4976 | 88 | 0.692 | 0.069 |
| 9727 | Existing Customer | 37 | M | 1 | Uneducated | Married | $120K + | Blue | 17 | 1 | 1 | 3 | 34516.0 | 0 | 34516.0 | 0.776 | 14127 | 116 | 0.731 | 0.000 |
| 6907 | Attrited Customer | 62 | M | 0 | Post-Graduate | Divorced | $60K - $80K | Silver | 46 | 4 | 5 | 6 | 28229.0 | 287 | 27942.0 | 0.594 | 2281 | 45 | 0.552 | 0.010 |
| 8857 | Attrited Customer | 52 | M | 1 | High School | Married | $120K + | Blue | 34 | 2 | 3 | 1 | 34516.0 | 0 | 34516.0 | 1.038 | 4863 | 56 | 0.474 | 0.000 |
| 7799 | Existing Customer | 40 | M | 3 | College | Divorced | $80K - $120K | Silver | 34 | 2 | 1 | 3 | 34516.0 | 2052 | 32464.0 | 0.477 | 3510 | 62 | 0.476 | 0.059 |
All data looks valid. All high credit limit customers has 1 or more dependents. And getting income more than 60K, Most of the customers falls beyond 80K and all of them we checking are having accounts more than a year and doing lot of transactions.
We can assume all the data is valid.
# Creating method to show log transformation
def showLogTransformation(cols_to_log, bins=20):
for colname in cols_to_log:
plt.hist(data[colname], bins)
plt.title(colname)
plt.show()
plt.hist(np.log(data[colname] + 1), bins)
plt.title(f"log transformation({colname})")
plt.show()
def applyLogTransformation(cols_to_log):
for colname in cols_to_log:
data[colname + "_log"] = np.log(data[colname] + 1)
data.drop(cols_to_log, axis=1, inplace=True)
# Checking how Log Transformed data looking
showLogTransformation(["Credit_Limit"], 50)
log transformed credit limit showing better data distribution.
lets add log transformed credit limit and rerun initial analysis on that feature
# Applying Log Transformed data and deleting existing feature
applyLogTransformation(["Credit_Limit"])
# Observations on Credit_Limit
histogram_boxplot(data, "Credit_Limit_log")
# checking outliers range
check_outlier_using_IQR("Credit_Limit_log", 3)
Column : Credit_Limit_log Outlier(s) check Mean : 8.60367493269847, Median : 8.422882511944996, Min : 7.271912163268625, Max : 10.44920723527944 Q1 = 7.846198815497425, Q3 = 9.31185851414622, 3*IQR = 4.396979095946383, Total Outlier(s) : 0 listing all outliers Series([], Name: Credit_Limit_log, dtype: float64) --------------------------------------------------
observation : mean and median almost matching. And no outliers shown in box plot
# Observations on Total_Revolving_Bal
histogram_boxplot(data, "Total_Revolving_Bal")
# checking outliers range
check_outlier_using_IQR("Total_Revolving_Bal", 3)
Column : Total_Revolving_Bal Outlier(s) check Mean : 1162.8140614199665, Median : 1276.0, Min : 0, Max : 2517 Q1 = 359.0, Q3 = 1784.0, 3*IQR = 4275.0, Total Outlier(s) : 0 listing all outliers Series([], Name: Total_Revolving_Bal, dtype: int64) --------------------------------------------------
# Checking how Log Transformed data looking
showLogTransformation(["Total_Revolving_Bal"], 50)
# Applying Log Transformed data and deleting existing feature
applyLogTransformation(["Total_Revolving_Bal"])
# Observations on Credit_Limit
histogram_boxplot(data, "Total_Revolving_Bal_log")
# checking outliers range
check_outlier_using_IQR("Total_Revolving_Bal_log", 3)
Column : Total_Revolving_Bal_log Outlier(s) check Mean : 5.491204393083048, Median : 7.152268856032539, Min : 0.0, Max : 7.831220214604293 Q1 = 5.886088599113236, Q3 = 7.487173694213739, 3*IQR = 4.803255285301509, Total Outlier(s) : 2470 listing 10 sample outliers 4044 0.0 3472 0.0 4108 0.0 8062 0.0 3795 0.0 2 0.0 4941 0.0 5633 0.0 4282 0.0 9820 0.0 Name: Total_Revolving_Bal_log, dtype: float64 --------------------------------------------------
# Observations on Avg_Open_To_Buy
histogram_boxplot(data, "Avg_Open_To_Buy")
# checking outliers range
check_outlier_using_IQR("Avg_Open_To_Buy", 3)
Column : Avg_Open_To_Buy Outlier(s) check Mean : 7469.139636614887, Median : 3474.0, Min : 3.0, Max : 34516.0 Q1 = 1324.5, Q3 = 9859.0, 3*IQR = 25603.5, Total Outlier(s) : 659 listing 10 sample outliers 9665 34227.0 2984 33859.0 9295 33385.0 10024 32439.0 2758 32887.0 9259 34516.0 3682 34516.0 9462 32793.0 9664 32553.0 1022 31999.0 Name: Avg_Open_To_Buy, dtype: float64 --------------------------------------------------
# Checking how Log Transformed data looking
showLogTransformation(["Avg_Open_To_Buy"], 50)
# Applying Log Transformed data and deleting existing feature
applyLogTransformation(["Avg_Open_To_Buy"])
# Observations on Credit_Limit
histogram_boxplot(data, "Avg_Open_To_Buy_log")
# checking outliers range
check_outlier_using_IQR("Avg_Open_To_Buy_log", 3)
Column : Avg_Open_To_Buy_log Outlier(s) check Mean : 8.164538314637664, Median : 8.153349757998892, Min : 1.3862943611198906, Max : 10.44920723527944 Q1 = 7.189544954583065, Q3 = 9.196241427024697, 3*IQR = 6.020089417324897, Total Outlier(s) : 1 listing all outliers 4443 1.386294 Name: Avg_Open_To_Buy_log, dtype: float64 --------------------------------------------------
# Observations on Total_Amt_Chng_Q4_Q1
histogram_boxplot(data, "Total_Amt_Chng_Q4_Q1")
# checking outliers range
check_outlier_using_IQR("Total_Amt_Chng_Q4_Q1", 3)
Column : Total_Amt_Chng_Q4_Q1 Outlier(s) check Mean : 0.7599406536980376, Median : 0.736, Min : 0.0, Max : 3.397 Q1 = 0.631, Q3 = 0.859, 3*IQR = 0.6839999999999999, Total Outlier(s) : 158 listing 10 sample outliers 1786 1.504 1698 1.483 1166 1.596 1105 1.494 984 1.451 1883 1.669 1008 1.593 3270 1.675 2498 1.454 929 1.458 Name: Total_Amt_Chng_Q4_Q1, dtype: float64 --------------------------------------------------
# Observations on Total_Trans_Amt
histogram_boxplot(data, "Total_Trans_Amt")
# checking outliers range
check_outlier_using_IQR("Total_Trans_Amt", 3)
Column : Total_Trans_Amt Outlier(s) check Mean : 4404.086303939963, Median : 3899.0, Min : 510, Max : 18484 Q1 = 2155.5, Q3 = 4741.0, 3*IQR = 7756.5, Total Outlier(s) : 746 listing 10 sample outliers 9891 15442 9295 14716 9174 12896 9594 16493 9438 13513 10086 16177 9799 15886 9786 15471 9583 15514 9262 12867 Name: Total_Trans_Amt, dtype: int64 --------------------------------------------------
# Checking how Log Transformed data looking
showLogTransformation(["Total_Trans_Amt"], 50)
# Applying Log Transformed data and deleting existing feature
applyLogTransformation(["Total_Trans_Amt"])
# Observations on Credit_Limit
histogram_boxplot(data, "Total_Trans_Amt_log")
# checking outliers range
check_outlier_using_IQR("Total_Trans_Amt_log", 3)
Column : Total_Trans_Amt_log Outlier(s) check Mean : 8.165163993911522, Median : 8.268731832117737, Min : 6.236369590203704, Max : 9.824714871370732 Q1 = 7.676241789209022, Q3 = 8.464214266625351, 3*IQR = 2.3639174322489884, Total Outlier(s) : 0 listing all outliers Series([], Name: Total_Trans_Amt_log, dtype: float64) --------------------------------------------------
# Observations on Total_Trans_Ct
histogram_boxplot(data, "Total_Trans_Ct")
# checking outliers range
check_outlier_using_IQR("Total_Trans_Ct", 3)
Column : Total_Trans_Ct Outlier(s) check Mean : 64.85869457884863, Median : 67.0, Min : 10, Max : 139 Q1 = 45.0, Q3 = 81.0, 3*IQR = 108.0, Total Outlier(s) : 0 listing all outliers Series([], Name: Total_Trans_Ct, dtype: int64) --------------------------------------------------
# Observations on Total_Ct_Chng_Q4_Q1
histogram_boxplot(data, "Total_Ct_Chng_Q4_Q1")
# checking outliers range
check_outlier_using_IQR("Total_Ct_Chng_Q4_Q1", 3)
Column : Total_Ct_Chng_Q4_Q1 Outlier(s) check Mean : 0.7122223758269962, Median : 0.702, Min : 0.0, Max : 3.714 Q1 = 0.582, Q3 = 0.818, 3*IQR = 0.708, Total Outlier(s) : 123 listing 10 sample outliers 1256 2.000 324 1.875 1041 1.750 392 1.636 457 1.500 432 1.429 69 2.000 1972 1.444 2358 1.882 2510 2.500 Name: Total_Ct_Chng_Q4_Q1, dtype: float64 --------------------------------------------------
# Observations on Avg_Utilization_Ratio
histogram_boxplot(data, "Avg_Utilization_Ratio")
# checking outliers range
check_outlier_using_IQR("Avg_Utilization_Ratio", 3)
Column : Avg_Utilization_Ratio Outlier(s) check Mean : 0.2748935518909845, Median : 0.176, Min : 0.0, Max : 0.999 Q1 = 0.023, Q3 = 0.503, 3*IQR = 1.44, Total Outlier(s) : 0 listing all outliers Series([], Name: Avg_Utilization_Ratio, dtype: float64) --------------------------------------------------
# Observations on Attrition_Flag
labeled_barplot(data, "Attrition_Flag", True)
replaceStruct = {"Attrition_Flag": {"Existing Customer": 0, "Attrited Customer": 1}}
data = data.replace(replaceStruct)
data["Attrition_Flag"] = data["Attrition_Flag"].astype("int64")
labeled_barplot(data=data, feature="Attrition_Flag", perc=True)
# Observations on Gender
labeled_barplot(data, "Gender", True)
# Observations on Education_Level
labeled_barplot(data, "Education_Level", True)
# Observations on Marital_Status
labeled_barplot(data, "Marital_Status", True)
# Observations on Income_Category
labeled_barplot(data, "Income_Category", True)
# Replacing abc with Unknown to keep it similar with other fields
replaceStruct = {"Income_Category": {"abc": "Unknown"}}
data = data.replace(replaceStruct)
labeled_barplot(data=data, feature="Income_Category", perc=True)
# Observations on Card_Category
labeled_barplot(data, "Card_Category", True)
Saving memory space
category_cols = [
"Attrition_Flag",
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
]
data[category_cols] = data[category_cols].astype("category")
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null int64 4 Education_Level 10127 non-null category 5 Marital_Status 10127 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Total_Amt_Chng_Q4_Q1 10127 non-null float64 13 Total_Trans_Ct 10127 non-null int64 14 Total_Ct_Chng_Q4_Q1 10127 non-null float64 15 Avg_Utilization_Ratio 10127 non-null float64 16 Credit_Limit_log 10127 non-null float64 17 Total_Revolving_Bal_log 10127 non-null float64 18 Avg_Open_To_Buy_log 10127 non-null float64 19 Total_Trans_Amt_log 10127 non-null float64 dtypes: category(6), float64(7), int64(7) memory usage: 1.1 MB
observations on data types
saved .4MB space after changing all object to category values
Given Dataset has 6 Category features and 14 numerical features
plt.figure(figsize=(20, 15))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
observations on heatmap
plt.figure(figsize=(30, 15))
sns.pairplot(data, hue="Attrition_Flag")
plt.show()
<Figure size 2160x1080 with 0 Axes>
observations on pairplot
generate_bivariate_chart(
xfeature="Total_Trans_Amt_log",
yfeature="Total_Trans_Ct",
data=data,
hue="Attrition_Flag",
)
Genrating Charts for feature x : Total_Trans_Amt_log, y : Total_Trans_Ct
generate_bivariate_chart(
xfeature="Credit_Limit_log",
yfeature="Avg_Open_To_Buy_log",
data=data,
hue="Attrition_Flag",
)
Genrating Charts for feature x : Credit_Limit_log, y : Avg_Open_To_Buy_log
generate_bivariate_chart(
xfeature="Customer_Age", yfeature="Months_on_book", data=data, hue="Attrition_Flag",
)
Genrating Charts for feature x : Customer_Age, y : Months_on_book
## droping high corelated features
data.drop("Total_Trans_Ct", axis=1, inplace=True)
data.drop("Avg_Open_To_Buy_log", axis=1, inplace=True)
data.drop("Contacts_Count_12_mon", axis=1, inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null int64 4 Education_Level 10127 non-null category 5 Marital_Status 10127 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Total_Amt_Chng_Q4_Q1 10127 non-null float64 12 Total_Ct_Chng_Q4_Q1 10127 non-null float64 13 Avg_Utilization_Ratio 10127 non-null float64 14 Credit_Limit_log 10127 non-null float64 15 Total_Revolving_Bal_log 10127 non-null float64 16 Total_Trans_Amt_log 10127 non-null float64 dtypes: category(6), float64(6), int64(5) memory usage: 931.0 KB
stacked_barplot(data, "Gender", "Attrition_Flag")
Attrition_Flag 0 1 All Gender All 8500 1627 10127 F 4428 930 5358 M 4072 697 4769 ------------------------------------------------------------------------------------------------------------------------
We dont really see any significant effects on Attrition by Male or Female
stacked_barplot(data, "Education_Level", "Attrition_Flag")
Attrition_Flag 0 1 All Education_Level All 8500 1627 10127 Graduate 2641 487 3128 High School 1707 306 2013 Unknown 1263 256 1519 Uneducated 1250 237 1487 College 859 154 1013 Doctorate 356 95 451 Post-Graduate 424 92 516 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Marital_Status", "Attrition_Flag")
Attrition_Flag 0 1 All Marital_Status All 8500 1627 10127 Married 3978 709 4687 Single 3275 668 3943 Unknown 620 129 749 Divorced 627 121 748 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Income_Category", "Attrition_Flag")
Attrition_Flag 0 1 All Income_Category All 8500 1627 10127 Less than $40K 2949 612 3561 $40K - $60K 1519 271 1790 $80K - $120K 1293 242 1535 $60K - $80K 1213 189 1402 Unknown 925 187 1112 $120K + 601 126 727 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Card_Category", "Attrition_Flag")
Attrition_Flag 0 1 All Card_Category All 8500 1627 10127 Blue 7917 1519 9436 Silver 473 82 555 Gold 95 21 116 Platinum 15 5 20 ------------------------------------------------------------------------------------------------------------------------
Silver and Blue Card_CategorySilver and Blue has very similar % of Attrition
Platinum and Gold has littler higher % than silver and blue but their overall volume contribution is not significant as blue category
cols = data[
[
"Customer_Age",
"Dependent_count",
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
]
].columns.tolist()
plt.figure(figsize=(15, 20))
for i, variable in enumerate(cols):
plt.subplot(3, 2, i + 1)
sns.boxplot(data["Attrition_Flag"], data[variable])
plt.tight_layout()
plt.title(variable)
plt.show()
cols = data[
[
"Total_Amt_Chng_Q4_Q1",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
"Credit_Limit_log",
"Total_Revolving_Bal_log",
"Total_Trans_Amt_log",
]
].columns.tolist()
plt.figure(figsize=(15, 20))
for i, variable in enumerate(cols):
plt.subplot(3, 2, i + 1)
sns.boxplot(data["Attrition_Flag"], data[variable])
plt.tight_layout()
plt.title(variable)
plt.show()
Dependents count, Card Relation counts, months inactive, bank contact counts all has some effects
Total Amount Change q4_a1, transaction change q4_q1 has some effects on Attrition
Avg Utilizaing ratio, Total Revolving balance, Total Transaction Amount, Credit limit has singificant impacts on Attrition
plt.figure(figsize=(15, 5))
sns.boxplot(
data["Education_Level"], data["Total_Trans_Amt_log"], hue=data["Attrition_Flag"]
)
plt.show()
plt.figure(figsize=(15, 5))
sns.boxplot(
data["Income_Category"], data["Total_Trans_Amt_log"], hue=data["Attrition_Flag"]
)
plt.show()
plt.figure(figsize=(15, 5))
sns.boxplot(
data["Education_Level"], data["Credit_Limit_log"], hue=data["Attrition_Flag"]
)
plt.show()
plt.figure(figsize=(15, 5))
sns.boxplot(
data["Income_Category"], data["Credit_Limit_log"], hue=data["Attrition_Flag"]
)
plt.show()
plt.figure(figsize=(15, 5))
sns.boxplot(
data["Education_Level"], data["Avg_Utilization_Ratio"], hue=data["Attrition_Flag"]
)
plt.show()
plt.figure(figsize=(15, 5))
sns.boxplot(
data["Income_Category"], data["Avg_Utilization_Ratio"], hue=data["Attrition_Flag"]
)
plt.show()
plt.figure(figsize=(15, 5))
sns.boxplot(
data["Education_Level"], data["Total_Revolving_Bal_log"], hue=data["Attrition_Flag"]
)
plt.show()
plt.figure(figsize=(15, 5))
sns.boxplot(
data["Income_Category"], data["Total_Revolving_Bal_log"], hue=data["Attrition_Flag"]
)
plt.show()
Unknown for those. We did not drop any data. log transformation to handle data skewness. Corelated feaures are dropped.# Separating target variable and other variables
X = data.drop(columns="Attrition_Flag")
Y = data["Attrition_Flag"]
print(f"Shape of X: {X.shape}, And Y: {Y.shape}")
print("Y feature, counts of label 'Yes': {}".format(sum(Y == 1)))
print("Y feature, counts of label 'No': {} \n".format(sum(Y == 0)))
Shape of X: (10127, 16), And Y: (10127,) Y feature, counts of label 'Yes': 1627 Y feature, counts of label 'No': 8500
Y.value_counts(normalize=True)
0 0.83934 1 0.16066 Name: Attrition_Flag, dtype: float64
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, Y, test_size=0.2, random_state=1, stratify=Y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 16) (2026, 16) (2026, 16)
y_train.value_counts(normalize=True)
0 0.839342 1 0.160658 Name: Attrition_Flag, dtype: float64
y_val.value_counts(normalize=True)
0 0.839092 1 0.160908 Name: Attrition_Flag, dtype: float64
y_test.value_counts(normalize=True)
0 0.839585 1 0.160415 Name: Attrition_Flag, dtype: float64
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)
print(f"After one hot encoding - Shape of X: {X_train.shape}")
(6075, 29) (2026, 29) (2026, 29) After one hot encoding - Shape of X: (6075, 29)
Let's start by building different models using KFold and cross_val_score and tune the best model using GridSearchCV and RandomizedSearchCV
Stratified K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive folds (without shuffling by default) keeping the distribution of both classes in each fold the same as the target variable. Each fold is then used once as validation while the k - 1 remaining folds form the training set.# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification(name, model,train_x_data,train_y_data,val_x_data,val_y_data):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
#print(f"Model '{name}' performance \n\n")
predictors_train = train_x_data
target_train = train_y_data
predictors_test = val_x_data
target_test = val_y_data
# predicting training data using the independent variables
pred_train = model.predict(predictors_train)
acc_train = accuracy_score(target_train, pred_train) # to compute Accuracy
recall_train = recall_score(target_train, pred_train) # to compute Recall
precision_train = precision_score(target_train, pred_train) # to compute Precision
f1_train = f1_score(target_train, pred_train) # to compute F1-score
roc_train = roc_auc_score(target_train, pred_train) # to compute ROC AUC
# predicting test data using the independent variables
pred_test = model.predict(predictors_test)
acc_test = accuracy_score(target_test, pred_test) # to compute Accuracy
recall_test = recall_score(target_test, pred_test) # to compute Recall
precision_test = precision_score(target_test, pred_test) # to compute Precision
f1_test = f1_score(target_test, pred_test) # to compute F1-score
roc_test = roc_auc_score(target_test, pred_test) # to compute ROC AUC
# creating a dataframe of metrics
df_perf = pd.DataFrame(
[
{
"Model": name,
"Data":"Training",
"Data Shape":train_x_data.shape,
"Recall": np.round_(recall_train * 100, decimals=3),
"F1-Score": np.round_(f1_train * 100, decimals=3),
"Accuracy": np.round_(acc_train * 100, decimals=3),
"Precision": np.round_(precision_train * 100, decimals=3),
"ROC-AUC": np.round_(roc_train * 100, decimals=3),
},{ "Model": name,
"Data":"Validation/Test",
"Data Shape":val_x_data.shape,
"Recall": np.round_(recall_test * 100, decimals=3),
"F1-Score": np.round_(f1_test * 100, decimals=3),
"Accuracy": np.round_(acc_test * 100, decimals=3),
"Precision": np.round_(precision_test * 100, decimals=3),
"ROC-AUC": np.round_(roc_test * 100, decimals=3),
},
]
)
return df_perf
def confusion_matrix_classification(name, model,train_x_data,train_y_data,val_x_data,val_y_data):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
sns.set_context(
"paper", rc={"font.size": 14, "axes.titlesize": 14, "axes.labelsize": 14}
)
print(f"Model '{name}' confusion matrix \n\n")
predictors_train = train_x_data
target_train = train_y_data
predictors_test = val_x_data
target_test = val_y_data
#predictors_output = test_y_data.copy()
y_pred_train = model.predict(predictors_train)
y_pred_test = model.predict(predictors_test)
cm_train = confusion_matrix(target_train, y_pred_train)
cm_test = confusion_matrix(target_test, y_pred_test)
#predictors_output['Orig'] = target_test
#predictors_output['Model'] = y_pred_test
plt.rcParams["figure.figsize"] = [12, 6]
plt.rcParams["figure.autolayout"] = True
f, axes = plt.subplots(1, 2)
labels = np.asarray(
[
[
"{0:0.0f}".format(item)
+ "\n{0:.2%}".format(item / cm_train.flatten().sum())
]
for item in cm_train.flatten()
]
).reshape(2, 2)
labels_test = np.asarray(
[
[
"{0:0.0f}".format(item)
+ "\n{0:.2%}".format(item / cm_test.flatten().sum())
]
for item in cm_test.flatten()
]
).reshape(2, 2)
g = sns.heatmap(cm_train, annot=labels, fmt="", ax=axes[0])
g.set(xlabel="True label", ylabel="Predicted label", title="Training Data")
g1 = sns.heatmap(cm_test, annot=labels_test, fmt="", ax=axes[1])
g1.set(xlabel="True label", ylabel="Predicted label", title="Validation/Test Data")
To avoid overfit on decision tree and random forests
#method that builds 6 moldes and print their scores, metrics and performance in a box plot
def build_models_and_score(data_type, t_x_data, t_y_data, v_x_data, v_y_data):
models = [] # Empty list to store all the models
models_scores = [] # Empty list to store all the models
# Appending models into the list
models.append(
(
"LR " + data_type,
LogisticRegression(
solver="newton-cg", penalty="none", verbose=False, n_jobs=-1
),
)
)
models.append(
(
"DT " + data_type,
DecisionTreeClassifier(
random_state=1, criterion="gini", max_depth=50, max_leaf_nodes=10,
),
)
)
models.append(
(
"Bag-DT " + data_type,
BaggingClassifier(
base_estimator=DecisionTreeClassifier(criterion="gini", random_state=1, max_depth=50, max_leaf_nodes=10), random_state=1,n_jobs=-1,n_estimators=100
),
)
)
models.append(
("RF " + data_type, RandomForestClassifier(random_state=1, n_estimators=100,n_jobs=-1, max_leaf_nodes=10))
)
models.append(("GBC " + data_type, GradientBoostingClassifier(random_state=1,n_estimators=100)))
models.append(("ABC " + data_type, AdaBoostClassifier(random_state=1,n_estimators=100)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
# scoring = "accuracy"
# scoring = "balanced_accuracy"
# scoring = "f1_weighted"
#scoring = "recall_weighted"
scoring = "recall"
# scoring = "roc_auc"
kfold = StratifiedKFold(
n_splits=10, shuffle=True, random_state=1
) # Setting number of splits equal to 10
cv_result = cross_val_score(
estimator=model, X=t_x_data, y=t_y_data, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "All Model Performance:" "\n")
for name, model in models:
model.fit(t_x_data, t_y_data)
# models_scores.append(model_performance_classification(name, model, X_train, y_train, X_val, y_val))
#print(f"\nModel Performance: {name}\n")
display(
model_performance_classification(
name, model, t_x_data, t_y_data, v_x_data, v_y_data
)
)
# Plotting boxplots for CV scores of all models defined above
#plt.figure(figsize=(20, 13))
plt.rcParams["figure.figsize"] = [20, 13]
sns.set(font_scale = 1.5)
fig = plt.figure()
plt.rcParams["figure.autolayout"] = True
fig.suptitle("Algorithm Comparison for "+ data_type)
ax = fig.add_subplot(111)
plt.boxplot(results)
plt.xticks(rotation=45)
ax.set_xticklabels(names)
plt.show()
build_models_and_score("Regular Data", X_train, y_train, X_val, y_val)
Cross-Validation Performance: LR Regular Data: 41.899852724594986 DT Regular Data: 58.913317904481374 Bag-DT Regular Data: 60.75846833578792 RF Regular Data: 28.071744161582156 GBC Regular Data: 75.61119293078056 ABC Regular Data: 74.47822427940248 All Model Performance:
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | LR Regular Data | Training | (6075, 29) | 42.520 | 54.605 | 88.642 | 76.287 | 69.995 |
| 1 | LR Regular Data | Validation/Test | (2026, 29) | 47.853 | 59.429 | 89.487 | 78.392 | 72.662 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | DT Regular Data | Training | (6075, 29) | 62.602 | 68.806 | 90.881 | 76.375 | 79.448 |
| 1 | DT Regular Data | Validation/Test | (2026, 29) | 69.325 | 71.293 | 91.017 | 73.377 | 82.251 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Bag-DT Regular Data | Training | (6075, 29) | 61.373 | 68.418 | 90.897 | 77.290 | 78.961 |
| 1 | Bag-DT Regular Data | Validation/Test | (2026, 29) | 66.871 | 70.209 | 90.869 | 73.898 | 81.171 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | RF Regular Data | Training | (6075, 29) | 30.840 | 46.024 | 88.379 | 90.663 | 65.116 |
| 1 | RF Regular Data | Validation/Test | (2026, 29) | 33.436 | 49.099 | 88.845 | 92.373 | 66.453 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | GBC Regular Data | Training | (6075, 29) | 82.480 | 87.595 | 96.247 | 93.387 | 90.681 |
| 1 | GBC Regular Data | Validation/Test | (2026, 29) | 84.969 | 88.076 | 96.298 | 91.419 | 91.720 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | ABC Regular Data | Training | (6075, 29) | 77.254 | 80.945 | 94.156 | 85.006 | 87.323 |
| 1 | ABC Regular Data | Validation/Test | (2026, 29) | 80.368 | 81.366 | 94.077 | 82.390 | 88.537 |
Logistic Regression - Model did not overfit, But very low recall and f1 score. Decision Tree Classifier - Model did not overfit, Decent Recall and F1 Score, Scores can be improved with Tuning Parameters. Bagging Classifier - Model did not overfit, Decent Recall and F1 Score, Scores can be improved with Tuning Parameters. Random Forest Classifier - Model did not overfit, Very good Accuracy but very low Recall scores.Gradient Boosting Classifier - Model did not overfit, Great Recall scores and all other scores Accuracy and F1 Score also good.AdaBoost Classifier - Model did not overfit, good recall scores and all other scores Accuracy and F1 Score also good.Decision Tree Classifier, Bagging Classifier and Gradient Boosting Classifier# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 976 Before Oversampling, counts of label 'No': 5099 After Oversampling, counts of label 'Yes': 5099 After Oversampling, counts of label 'No': 5099 After Oversampling, the shape of train_X: (10198, 29) After Oversampling, the shape of train_y: (10198,)
build_models_and_score("Over Sampled Data", X_train_over, y_train_over, X_val, y_val)
Cross-Validation Performance: LR Over Sampled Data: 86.40891405678187 DT Over Sampled Data: 78.34851111367927 Bag-DT Over Sampled Data: 79.87857775723256 RF Over Sampled Data: 84.82037058438307 GBC Over Sampled Data: 93.70488077352748 ABC Over Sampled Data: 92.68500327439423 All Model Performance:
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | LR Over Sampled Data | Training | (10198, 29) | 86.586 | 88.397 | 88.635 | 90.286 | 88.635 |
| 1 | LR Over Sampled Data | Validation/Test | (2026, 29) | 57.362 | 55.738 | 85.341 | 54.203 | 74.034 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | DT Over Sampled Data | Training | (10198, 29) | 78.721 | 84.381 | 85.429 | 90.917 | 85.429 |
| 1 | DT Over Sampled Data | Validation/Test | (2026, 29) | 79.755 | 70.365 | 89.191 | 62.954 | 85.377 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Bag-DT Over Sampled Data | Training | (10198, 29) | 80.173 | 84.901 | 85.742 | 90.223 | 85.742 |
| 1 | Bag-DT Over Sampled Data | Validation/Test | (2026, 29) | 81.595 | 70.651 | 89.092 | 62.295 | 86.062 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | RF Over Sampled Data | Training | (10198, 29) | 85.291 | 87.024 | 87.282 | 88.828 | 87.282 |
| 1 | RF Over Sampled Data | Validation/Test | (2026, 29) | 83.129 | 66.996 | 86.821 | 56.108 | 85.329 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | GBC Over Sampled Data | Training | (10198, 29) | 94.646 | 95.413 | 95.450 | 96.193 | 95.450 |
| 1 | GBC Over Sampled Data | Validation/Test | (2026, 29) | 85.583 | 80.519 | 93.337 | 76.022 | 90.203 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | ABC Over Sampled Data | Training | (10198, 29) | 93.293 | 93.476 | 93.489 | 93.660 | 93.489 |
| 1 | ABC Over Sampled Data | Validation/Test | (2026, 29) | 80.982 | 74.894 | 91.264 | 69.657 | 87.108 |
Logistic Regression - Model overfit, Recall scores poor in validation data set. Decision Tree Classifier - Model overfit slightly, Great Recall scores and all other scores Accuracy and F1 Score also good but all scores showing overfit.Bagging Classifier - Model overfit slightly, Great Recall scores and all other scores Accuracy and F1 Score also good but all scores showing overfit.Random Forest Classifier - Model overfit, Good Accuracy and Recall scores. F1 Score is very bad on validation. Gradient Boosting Classifier - Model overfit slightly, Great Recall scores and all other scores Accuracy and F1 Score also good but all scores showing overfit.AdaBoost Classifier - Model overfit slightly, Great Recall scores and all other scores Accuracy and F1 Score also good but all scores showing overfit.rus = RandomUnderSampler(random_state=1)
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_under == 1)))
print(
"After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_under == 0))
)
print("After Under Sampling, the shape of train_X: {}".format(X_train_under.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_under.shape))
Before Under Sampling, counts of label 'Yes': 976 Before Under Sampling, counts of label 'No': 5099 After Under Sampling, counts of label 'Yes': 976 After Under Sampling, counts of label 'No': 976 After Under Sampling, the shape of train_X: (1952, 29) After Under Sampling, the shape of train_y: (1952,)
build_models_and_score("Under Sampled Data", X_train_under, y_train_under, X_val, y_val)
Cross-Validation Performance: LR Under Sampled Data: 74.48558804965285 DT Under Sampled Data: 84.83904902167053 Bag-DT Under Sampled Data: 87.70039974752788 RF Under Sampled Data: 84.32463707132337 GBC Under Sampled Data: 92.82242794024826 ABC Under Sampled Data: 88.93014937933937 All Model Performance:
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | LR Under Sampled Data | Training | (1952, 29) | 76.537 | 77.732 | 78.074 | 78.964 | 78.074 |
| 1 | LR Under Sampled Data | Validation/Test | (2026, 29) | 76.994 | 51.434 | 76.604 | 38.615 | 76.762 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | DT Under Sampled Data | Training | (1952, 29) | 85.963 | 87.533 | 87.756 | 89.160 | 87.756 |
| 1 | DT Under Sampled Data | Validation/Test | (2026, 29) | 85.890 | 68.627 | 87.364 | 57.143 | 86.768 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Bag-DT Under Sampled Data | Training | (1952, 29) | 90.676 | 90.769 | 90.779 | 90.862 | 90.779 |
| 1 | Bag-DT Under Sampled Data | Validation/Test | (2026, 29) | 90.184 | 70.929 | 88.105 | 58.449 | 88.945 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | RF Under Sampled Data | Training | (1952, 29) | 86.988 | 87.166 | 87.193 | 87.346 | 87.193 |
| 1 | RF Under Sampled Data | Validation/Test | (2026, 29) | 87.117 | 64.545 | 84.600 | 51.264 | 85.617 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | GBC Under Sampled Data | Training | (1952, 29) | 97.234 | 96.199 | 96.158 | 95.186 | 96.158 |
| 1 | GBC Under Sampled Data | Validation/Test | (2026, 29) | 94.479 | 79.177 | 92.004 | 68.142 | 93.004 |
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | ABC Under Sampled Data | Training | (1952, 29) | 93.340 | 93.197 | 93.186 | 93.054 | 93.186 |
| 1 | ABC Under Sampled Data | Validation/Test | (2026, 29) | 92.331 | 73.594 | 89.339 | 61.179 | 90.548 |
Logistic Regression - Model overfit, Recall scores are good but other scores are bad in validation data set. Decision Tree Classifier - Model overfit slightly, Great Recall scores and all other scores Accuracy and F1 Score also good but all scores showing overfit.Bagging Classifier - Model overfit slightly, Great Recall scores and all other scores Accuracy and F1 Score also good but all scores showing overfit.Random Forest Classifier - Model overfit slightly, Great Recall scores and all other scores Accuracy and F1 Score also good but all scores showing overfit.Gradient Boosting Classifier - Model overfit slightly, Great Recall scores and all other scores Accuracy and F1 Score also good but all scores showing overfit.AdaBoost Classifier - Model overfit slightly, Great Recall scores and all other scores Accuracy and F1 Score also good but all scores showing overfit.Decision Tree Classifier, Bagging Classifier and Gradient Boosting Classifier with regular data is best models out of 18 Models we tried.
We can see that the Decision Tree Classifier, Bagging Classifier and Gradient Boosting Classifier is giving the highest cross-validated recall compared to other models and no overfitting on Accuracy, Precision and F1 scores.
The boxplot shows that the performance of Decision Tree Classifier, Bagging Classifier and Gradient Boosting Classifier is consistent and their performance on the validation set is also good
We will tune the best three models i.e. Decision Tree Classifier, Bagging Classifier and Gradient Boosting Classifier and see if the performance improves
We will tune Decision Tree Classifier, Bagging Classifier and Gradient Boosting Classifier models using GridSearchCV and RandomizedSearchCV. We will also compare the performance and time taken by these two methods - grid search and randomized search.
# Parameter grid to pass in GridSearchCV & RandomizedSearchCV
parameters_dt = {
"max_depth": [np.arange(2, 50, 5), None],
"class_weight" : [{0: 0.20, 1: 0.80},{0: 0.30, 1: 0.70},"balanced"],
"criterion": ["gini","entropy"],
"splitter": ["best", "random"],
"max_features": [0.5,0.6,0.7,0.8,0.9,1.0,"auto", "sqrt", "log2" ],
"min_impurity_decrease":[0.0002,0.0005,0.001,0.0015,0.002,0.005,0.01],
"min_samples_leaf": [5, 10, 20, 50, 100],
"max_leaf_nodes": [np.arange(1, 20, 1), None],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.accuracy_score)
kfold = StratifiedKFold(
n_splits=10, shuffle=True, random_state=1
) # Setting number of splits equal to 10
model_params = DecisionTreeClassifier()
print(f"Model supported Parameter : ")
model_params.get_params()
Model supported Parameter :
{'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'random_state': None,
'splitter': 'best'}
# model = DecisionTreeClassifier(random_state=1, class_weight={0: 0.20, 1: 0.80})
model = DecisionTreeClassifier(random_state=1)
# Calling GridSearchCV
grid_cv = GridSearchCV(
estimator=model,
param_grid=parameters_dt,
scoring=scorer,
cv=kfold,
n_jobs=-1,
verbose=2,
)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"\nBest Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
best_model_dt_grid = grid_cv.best_estimator_
# Fit the best algorithm to the data.
best_model_dt_grid.fit(X_train, y_train)
display(
model_performance_classification(
"Tuned DT - GridSearchCV", best_model_dt_grid, X_train, y_train, X_val, y_val
)
)
confusion_matrix_classification(
"Tuned DT - GridSearchCV", best_model_dt_grid, X_train, y_train, X_val, y_val
)
Fitting 10 folds for each of 15120 candidates, totalling 151200 fits
Best Parameters:{'class_weight': {0: 0.3, 1: 0.7}, 'criterion': 'gini', 'max_depth': None, 'max_features': 0.9, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0005, 'min_samples_leaf': 5, 'splitter': 'best'}
Score: 0.920829630191624
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Tuned DT - GridSearchCV | Training | (6075, 29) | 94.877 | 87.276 | 95.556 | 80.803 | 95.281 |
| 1 | Tuned DT - GridSearchCV | Validation/Test | (2026, 29) | 88.344 | 80.785 | 93.238 | 74.419 | 91.260 |
Model 'Tuned DT - GridSearchCV' confusion matrix
# model = DecisionTreeClassifier(random_state=1, class_weight={0: 0.20, 1: 0.80})
model = DecisionTreeClassifier(random_state=1)
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=model,
param_distributions=parameters_dt,
n_iter=100,
scoring=scorer,
random_state=1,
cv=kfold,
n_jobs=-1,
verbose=2,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
best_model_dt_rand = randomized_cv.best_estimator_
# Fit the best algorithm to the data.
best_model_dt_rand.fit(X_train, y_train)
display(
model_performance_classification(
"Tuned DT - RandomizedSearchCV",
best_model_dt_rand,
X_train,
y_train,
X_val,
y_val,
)
)
confusion_matrix_classification(
"Tuned DT - RandomizedSearchCV", best_model_dt_rand, X_train, y_train, X_val, y_val
)
Fitting 10 folds for each of 100 candidates, totalling 1000 fits
Best parameters are {'splitter': 'best', 'min_samples_leaf': 5, 'min_impurity_decrease': 0.0015, 'max_leaf_nodes': None, 'max_features': 0.5, 'max_depth': None, 'criterion': 'entropy', 'class_weight': {0: 0.3, 1: 0.7}} with CV score=0.9041923176970433:
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Tuned DT - RandomizedSearchCV | Training | (6075, 29) | 87.193 | 81.552 | 93.663 | 76.598 | 91.047 |
| 1 | Tuned DT - RandomizedSearchCV | Validation/Test | (2026, 29) | 79.141 | 72.779 | 90.474 | 67.363 | 85.894 |
Model 'Tuned DT - RandomizedSearchCV' confusion matrix
# Parameter grid to pass in GridSearchCV & RandomizedSearchCV
cl1 = DecisionTreeClassifier(class_weight={0:0.20,1:0.80},random_state=1)
parameters_bg = {
'base_estimator':[cl1],
'max_samples': [0.7,0.8,0.9,1],
'max_features': [0.7,0.8,0.9,1],
'n_estimators': [50,100,150,200],
"bootstrap": [True, False],
"bootstrap_features": [True, False]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1) # Setting number of splits equal to 5
model_params = BaggingClassifier()
print(f"Model supported Parameter : ")
model_params.get_params()
Model supported Parameter :
{'base_estimator': None,
'bootstrap': True,
'bootstrap_features': False,
'max_features': 1.0,
'max_samples': 1.0,
'n_estimators': 10,
'n_jobs': None,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': False}
# Calling GridSearchCV
grid_cv = GridSearchCV(
estimator=BaggingClassifier(random_state=1),
param_grid=parameters_bg,
scoring=scorer,
cv=kfold,
n_jobs=-1,
verbose=2,
)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"\nBest Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
best_model_bag_grid = grid_cv.best_estimator_
# Fit the best algorithm to the data.
best_model_bag_grid.fit(X_train, y_train)
display(
model_performance_classification(
"Bagging - GridSearchCV", best_model_bag_grid, X_train, y_train, X_val, y_val
)
)
confusion_matrix_classification(
"Bagging - GridSearchCV", best_model_bag_grid, X_train, y_train, X_val, y_val
)
Fitting 5 folds for each of 256 candidates, totalling 1280 fits
Best Parameters:{'base_estimator': DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1), 'bootstrap': False, 'bootstrap_features': False, 'max_features': 0.9, 'max_samples': 0.9, 'n_estimators': 200}
Score: 0.7581946624803767
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Bagging - GridSearchCV | Training | (6075, 29) | 100.000 | 100.000 | 100.000 | 100.000 | 100.000 |
| 1 | Bagging - GridSearchCV | Validation/Test | (2026, 29) | 81.902 | 83.307 | 94.719 | 84.762 | 89.539 |
Model 'Bagging - GridSearchCV' confusion matrix
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=BaggingClassifier(random_state=1),
param_distributions=parameters_bg,
n_iter=20,
scoring=scorer,
random_state=1,
cv=kfold,
n_jobs=-1,
verbose=2,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
best_model_bag_rand = randomized_cv.best_estimator_
# Fit the best algorithm to the data.
best_model_bag_rand.fit(X_train, y_train)
display(
model_performance_classification(
"Bagging - RandomizedSearchCV",
best_model_bag_rand,
X_train,
y_train,
X_val,
y_val,
)
)
confusion_matrix_classification(
"Bagging - RandomizedSearchCV", best_model_bag_rand, X_train, y_train, X_val, y_val
)
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best parameters are {'n_estimators': 150, 'max_samples': 0.9, 'max_features': 0.9, 'bootstrap_features': False, 'bootstrap': False, 'base_estimator': DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1)} with CV score=0.7520460491889063:
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Bagging - RandomizedSearchCV | Training | (6075, 29) | 100.000 | 100.000 | 100.000 | 100.000 | 100.000 |
| 1 | Bagging - RandomizedSearchCV | Validation/Test | (2026, 29) | 81.902 | 83.438 | 94.768 | 85.032 | 89.569 |
Model 'Bagging - RandomizedSearchCV' confusion matrix
model = GradientBoostingClassifier(random_state=1)
model = model.fit(X_train, y_train)
display(
model_performance_classification(
"Gradient Boosting - Base Model", model, X_train, y_train, X_val, y_val,
)
)
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Gradient Boosting - Base Model | Training | (6075, 29) | 82.480 | 87.595 | 96.247 | 93.387 | 90.681 |
| 1 | Gradient Boosting - Base Model | Validation/Test | (2026, 29) | 84.969 | 88.076 | 96.298 | 91.419 | 91.720 |
model_params = GradientBoostingClassifier(random_state=1)
print(f"Model supported Parameter : ")
model_params.get_params()
Model supported Parameter :
{'ccp_alpha': 0.0,
'criterion': 'friedman_mse',
'init': None,
'learning_rate': 0.1,
'loss': 'deviance',
'max_depth': 3,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_iter_no_change': None,
'random_state': 1,
'subsample': 1.0,
'tol': 0.0001,
'validation_fraction': 0.1,
'verbose': 0,
'warm_start': False}
learning_rates = [1, 0.50,0.1,0.09,0.05]
n_estimators = [50, 100,125]
max_depths = [3,5,7]
# Grid of parameters to choose from
parameters_gb = {
"learning_rate":learning_rates,
"n_estimators":n_estimators,
'max_depth':max_depths,
"subsample":[0.5, 0.8, 0.9, 0.95, 1.0],
}
print(f"parameters_gb : {parameters_gb}")
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.accuracy_score)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1) # Setting number of splits equal to 5
parameters_gb : {'learning_rate': [1, 0.5, 0.1, 0.09, 0.05], 'n_estimators': [50, 100, 125], 'max_depth': [3, 5, 7], 'subsample': [0.5, 0.8, 0.9, 0.95, 1.0]}
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(random_state=1)
# Run the grid search
grid_obj = GridSearchCV(
gbc_tuned,
parameters_gb,
scoring=scorer,
cv=kfold,
n_jobs=-1,
verbose=2,
)
grid_obj = grid_obj.fit(X_train, y_train)
print(
"\nBest Parameters:{} with CV score: {}".format(grid_obj.best_params_, grid_obj.best_score_)
)
# Set the clf to the best combination of parameters
gbc_tuned_grid = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned_grid.fit(X_train, y_train)
display(
model_performance_classification(
"Gradient Boosting - GridSearchCV",
gbc_tuned_grid,
X_train,
y_train,
X_val,
y_val,
)
)
confusion_matrix_classification(
"Gradient Boosting - GridSearchCV", gbc_tuned_grid, X_train, y_train, X_val, y_val
)
Fitting 5 folds for each of 225 candidates, totalling 1125 fits
Best Parameters:{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100, 'subsample': 1.0} with CV score: 0.9565432098765433
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Gradient Boosting - GridSearchCV | Training | (6075, 29) | 95.799 | 96.891 | 99.012 | 98.008 | 97.713 |
| 1 | Gradient Boosting - GridSearchCV | Validation/Test | (2026, 29) | 85.890 | 87.774 | 96.150 | 89.744 | 92.004 |
Model 'Gradient Boosting - GridSearchCV' confusion matrix
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(random_state=1)
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
gbc_tuned,
param_distributions=parameters_gb,
n_iter=50,
scoring=scorer,
random_state=1,
cv=kfold,
n_jobs=-1,
verbose=2,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
gbc_tuned_rand = randomized_cv.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned_rand.fit(X_train, y_train)
display(
model_performance_classification(
"Gradient Boosting - RandomizedSearchCV",
gbc_tuned_rand,
X_train,
y_train,
X_val,
y_val,
)
)
confusion_matrix_classification(
"Gradient Boosting - RandomizedSearchCV",
gbc_tuned_rand,
X_train,
y_train,
X_val,
y_val,
)
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters are {'subsample': 1.0, 'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.1} with CV score=0.9565432098765433:
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Gradient Boosting - RandomizedSearchCV | Training | (6075, 29) | 95.799 | 96.891 | 99.012 | 98.008 | 97.713 |
| 1 | Gradient Boosting - RandomizedSearchCV | Validation/Test | (2026, 29) | 85.890 | 87.774 | 96.150 | 89.744 | 92.004 |
Model 'Gradient Boosting - RandomizedSearchCV' confusion matrix
Decision Tree - GridsearchCV has better performance & recall score compared to RandomisedsearchCV tuning in decison tree.Bagging - Overfit on both GridsearchCV and RandomisedsearchCVGradient Boosting - Shows better recall and accurancy scores on both GridsearchCV and RandomisedsearchCV compared to bagging and decsion tree models. Gradient Boosting GridsearchCV and RandomisedsearchCV - Both results are showing same numbers in all metrics¶Both GridsearchCV and RandomisedsearchCV picked same param combinations
We can pick RandomisedsearchCV model for productionizing & pipeline
display(
model_performance_classification(
"Final Model - Train & Validation",
gbc_tuned_rand,
X_train,
y_train,
X_val,
y_val,
)
)
confusion_matrix_classification(
"Final Model - Train & Validation", gbc_tuned_rand, X_train, y_train, X_val, y_val,
)
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Final Model - Train & Validation | Training | (6075, 29) | 95.799 | 96.891 | 99.012 | 98.008 | 97.713 |
| 1 | Final Model - Train & Validation | Validation/Test | (2026, 29) | 85.890 | 87.774 | 96.150 | 89.744 | 92.004 |
Model 'Final Model - Train & Validation' confusion matrix
display(
model_performance_classification(
"Final Model - Train & Test", gbc_tuned_rand, X_train, y_train, X_test, y_test,
)
)
confusion_matrix_classification(
"Final Model - Train & Test", gbc_tuned_rand, X_train, y_train, X_test, y_test,
)
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | Final Model - Train & Test | Training | (6075, 29) | 95.799 | 96.891 | 99.012 | 98.008 | 97.713 |
| 1 | Final Model - Train & Test | Validation/Test | (2026, 29) | 84.923 | 89.176 | 96.693 | 93.878 | 91.932 |
Model 'Final Model - Train & Test' confusion matrix
feature_names = X_train.columns
importances = gbc_tuned_rand.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Total Transaction Amount, Total transaction ct change q4_q1, total revolving balance, totals accounts in the bank are the top 4 features required to determine customer going to attration account or not
As we saw in initial EDA Gender, Matitial Status, Income Level, Education level does not contribute much as numerical features.
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
## Copy Data from orginal data set
# copying orignal data so that when changing data we dont lose original
data1 = bank_data.copy()
data1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
# For Log Transform
def log_transform(x):
return np.log(x + 1)
# Collect all Numerical Columns that needs imputation
numerical_features = [
"Customer_Age",
"Dependent_count",
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Total_Revolving_Bal",
"Total_Amt_Chng_Q4_Q1",
"Total_Trans_Amt",
"Credit_Limit",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
]
log_transform_features = [
"Credit_Limit",
"Total_Revolving_Bal",
"Total_Trans_Amt",
]
# Collect all Categorical Columns that needs imputation
categorical_features = [
"Education_Level",
"Gender",
"Marital_Status",
"Income_Category",
"Card_Category",
]
# creating a transformer for numerical variables, which will apply simple imputer on the numerical variables
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
# creating a transformer for numerical variables with data skewness Log Transformation
numeric_log_transformer = Pipeline(steps=[("log", FunctionTransformer(log_transform))])
# creating a transformer for categorical variables, which will first apply simple imputer and
# then do one hot encoding for categorical variables
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
# handle_unknown = "ignore", allows model to handle any unknown category in the test data
# combining categorical transformer and numerical transformer using a column transformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical_features),
("log", numeric_log_transformer, log_transform_features),
("cat", categorical_transformer, categorical_features),
],
)
cols_reqd_X = [
"Customer_Age",
"Gender",
"Dependent_count",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Total_Amt_Chng_Q4_Q1",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
"Credit_Limit",
"Total_Revolving_Bal",
"Total_Trans_Amt",
]
replaceStruct = {"Attrition_Flag": {"Existing Customer": 0, "Attrited Customer": 1}}
# Separating target variable and other variables
X = data1[cols_reqd_X]
Y = data1["Attrition_Flag"].map({"Existing Customer": 0, "Attrited Customer": 1})
# import some data within sklearn for iris classification
# iris = datasets.load_iris()
# X = iris.data
# Y = iris.target
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
print(X_train.shape, X_test.shape)
(7088, 16) (3039, 16)
y_train.value_counts(normalize=True)
0 0.839 1 0.161 Name: Attrition_Flag, dtype: float64
y_test.value_counts(normalize=True)
0 0.839 1 0.161 Name: Attrition_Flag, dtype: float64
# Model Based on Parameters based on hypertuning performance
gbc_tuned = GradientBoostingClassifier(
random_state=1, subsample=1.0, n_estimators=100, max_depth=5, learning_rate=0.1
)
# Creating new pipeline with best parameters
final_pipeline_model = Pipeline(
steps=[("pre", preprocessor), ("GBC-Tuned", gbc_tuned)], verbose=True
)
# Fit the model on training data
final_pipeline_model.fit(X_train, y_train)
[Pipeline] ............... (step 1 of 2) Processing pre, total= 0.0s [Pipeline] ......... (step 2 of 2) Processing GBC-Tuned, total= 2.7s
Pipeline(steps=[('pre',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['Customer_Age',
'Dependent_count',
'Months_on_book',
'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Total_Revolving_Bal',
'Total_Amt_Chng_Q4_Q1',
'Total_Trans_Amt',
'Credit_Limit',
'Total_Ct_Chng_Q4_Q1',
'Avg_Utilization_Ratio...
'Total_Revolving_Bal',
'Total_Trans_Amt']),
('cat',
Pipeline(steps=[('imputer',
SimpleImputer(fill_value='Unknown',
strategy='constant')),
('onehot',
OneHotEncoder(handle_unknown='ignore'))]),
['Education_Level', 'Gender',
'Marital_Status',
'Income_Category',
'Card_Category'])])),
('GBC-Tuned',
GradientBoostingClassifier(max_depth=5, random_state=1))],
verbose=True)
display(
model_performance_classification(
"PipeLine Model - Train & Test",
final_pipeline_model,
X_train,
y_train,
X_test,
y_test,
)
)
confusion_matrix_classification(
"PipeLine Model - Train & Test",
final_pipeline_model,
X_train,
y_train,
X_test,
y_test,
)
| Model | Data | Data Shape | Recall | F1-Score | Accuracy | Precision | ROC-AUC | |
|---|---|---|---|---|---|---|---|---|
| 0 | PipeLine Model - Train & Test | Training | (7088, 16) | 94.996 | 96.092 | 98.758 | 97.215 | 97.237 |
| 1 | PipeLine Model - Train & Test | Validation/Test | (3039, 16) | 84.016 | 87.701 | 96.216 | 91.723 | 91.283 |
Model 'PipeLine Model - Train & Test' confusion matrix
Foucs on data collection to avoid missing data & collect correct data on Income_Category, Education Category & Marital_Status this will help to predict customer actions accuractly
16% customers accounts ends in attrition as per previous data. Decreseing attrition will eventually ends in more actuve accounts & revenue improvement oppurtnities for bank
Bank should contact customers if they see inactive for 3-4 months to avoid future attrition
Customers with more relation on bank like different cards might not use all cards, Bank should contact customers and encourge them to use all card or provide offers on the cards they dont use. This will help any future attrition on unsued accounts
93% customers using Blue card category, As per income category more customers are eligible for High card tier. Bank should promote those card types for eligible customers.
Less transactions on given Quater, Bank should reach those customers and encourge them to use cards
Customers contacts and still show inactive or less transactions bank should provide them some offers to encourge them to use cards
Total Transaction Amount/Credit Limit/Total_Revolving_Balance should be watched and if customers reaching their limit see if customers eligible for increase credit limit, This would allow customers actively use their accounts
Watch transaction counts/transaction amount of customer, If you see less tranactions compared to previous Quater then bank should reach customer and check their status